A real-world example

We're going to take a look at data on COVID coverage by local news agencies in the US.

There are a number of features in this dataset, there's a more detailed description of most of them here. I'll describe what we're using here, though, obviusly

Note: I am, again, using data that I have used for research in the past. I think it is useful, because I am able to provide more thoughtful responses to questions and comments about the data and methods. It is not because I think my research is especially great, or that you should read it.

Walking the ML Pipeline

Real world goal

Find local news deserts. In other words, find places where people aren('t) likely to be getting adequate news about COVID

Real world mechanism

Identify places where the number of articles that cover COVID is low. We can then try to forecast coverage in locations where we don't have data.

Learning problem

Predict the percentage of weekly coverage devoted to COVID for local news outlets across the country that make data available

Data Collection

Data Representation

Below, we display the first five rows of the data...

In general, here, we would also work to clean up our data for analysis - rescaling features, dropping features, one-hot encoding, etc. For the purposes of this notebook, though, where we're going to do some exploration, I'm going to leave that to the modeling section

Target Class/Model

We're going to stick with linear regression for now, but we can choose to add noon-linear . Now we know what that is! And how to optimize it (although we'll use the sklearn implementation

Training dataset

Picking a "good" training dataset ... getting us started

Next week, we'll cover model evaluation in more detail. For now, we're just going to note that to ensure our model is generalizable and not overfit to the training data, we need to separate out a training dataset from a test dataset.

Exercise: Give a high-level argument for why evaluating on the training data is a bad idea

With temporal data, and more generally, data with dependencies, it is also important that we make sure that we are avoiding the leakage of information from the training data in a way that creates a biased understanding of how well we are making predictions. Leakage can happen in at least two ways (again, we'll go into more detail next week):

For this simple example, for now, we're going to ignore the leakage issue. We'll come back and fix that next week

Picking ... a dataset

Neat! But to pick a good training dataset, we first need to know ... what our dataset is. This data has a lot of features. In class, we'll play with a bunch of them, together. Here, I'm just going to get us started

Model training

OK, let's have at it!

Predict on Test Data

Evaluate error

Deploy (?)

What might we be asking ourselves before we deploy? What might we try to change? Let's work on it!

In class exercise ... beat Kenny's predictive model!

The missing steps in the pipeline

The above pipeline is a barebones representation. In reality, we are trying a bunch of different models, data representations, and even questions, before we deploy. Here's a representation of the pipeline that trends further towards that, from this tweet.

Let's look at how we might do some of these other steps

Exploratory Data Analysis (EDA)

Basic plotting and stats

Using a simple model

One other form of EDA that I personally often leverage is to fit a simple (usually linear) model to the data and explore the coefficients for indicators of what effects might exist.

As we have now discussed in class, thouhg, we have to be really careful when interpreting coefficients for models with transformed predictor variables. Here, for example, is a useful resource for your programming assignment.

In our case, we actually ended up using a variable that means interpreting our coefficients using interpretations for logistic regression. Here is a good explanation. We will cover this in more detail next week, but a simple plot below to discuss!

Exercise: Are the estimates from the coefficients we used comparable? Which are, and which are not? What might we do to make them even more comparable?

For this demo, I took code from this sklearn tutorial. The tutorial is very nice, I would highly recommend going through it, although I will teach most of what is in this over the next week or two in one way or another.

Validation Demos

Some of the other things we've discussed in class don't fit neatly within a real-world ML pipeline, but are useful to show on this dataset... these are now in this section below

k-Fold Cross-validation

Learning Curves (And, an intro to pipelines)

kNN Regression and validation curves

Yikes! Compare that to the standard sklearn example.

What gives?

A Full Model Selection Pipeline (Nested CV + Two-way Holdout)

!!!!!!!! Disclaimer: this is a modified version of Sebastian Raschka's demo [here](https://github.com/rasbt/stat451-machine-learning-fs20/blob/master/L11/code/11-eval4-algo__nested-cv_verbose1.ipynb) !!!!!

Lasso Example

GAM Example